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ABSTRACT: 

The limited range in its abscissa of ranked letter frequency distributions causes multiple functions to fit the 
observed distribution reasonably well. In order to critically compare various functions, we apply the statistical 

•rH ' 

\ model selections on ten functions, using the texts of U.S. and Mexican presidential speeches in the last 1-2 

Jh ...... 

, centuries. Dispite minor switching of ranking order of certain letters during the temporal evolution for both 

datasets, the letter usage is generally stable. The best fitting function, judged by either least-square-error or 

by AIC/BIC model selection, is the Cocho/Beta function. We also use a novel method to discover clusters of 

letters by their observed-over-expected frequency ratios. 
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1 Introduction 



Although morphemes, not letters, are usually considered to be the smallest linguistic unit, 
studying statistics of letter usage has its own merit. For example, info rmation on lette r fre- 
quency is essential in cryptography for deciphering a substitution code (jFriedmanlll976l ). and 
"frequency analysis" was u sed in as early as the 9th century by the Arab scientist al-Kindi for 



the purpose of decryption (IMrayati et al. 



2003|). 



An efficient design of a communication code also depends crucially on the letter frequency. 
The shortest Morse code is reserved to letters that are the most common: one dot for letter e 
and one dash for letter t, both letters being the most frequent in En glish. The same principle 
is also behind the design of minimum- redundancy code by Huffman (|Huffmanlll952l ). 

The initial motivation for the "QWERTY" mechanical typewriter design is to keep the most 
common lett ers far away in the keyboard so that metal bars would not jam for a fast typist 



(IDavid 



19851 ). Even in modern times, t he digraph (lette r pairs) frequency is an important 



piece of information for keyboard design (IZhai et al.lll999l ). 

In all these examples, a quantitative description of letter usage frequency is important. 
Unlike the ranked word fre quency di stribution, which is well characterized by a simple power- 
law function or Zipf's law (jZipilll935l ). it is not clear whether a u niversal fitting function ex ists 
despite a claim of such a function (the logarithmic function) in (IKanter and Kesslerlll995l ). 

In this paper, we aim at critically examining various functional forms of fitting rank- 
frequency distribution of letters, ranging from simple to more complicated ones with two 
or three free parameters. The dataset used is the historical U.S. and Mexican presidential 
speeches. The presidential speeches are readil y available (see an other study where the Italian 
presidential "end of year" addresses are used (jTuzzy et al.ll2009l )). they also offer an opportu- 
nity for investigations of temporal patterns in letter usage. 

The ranked word frequency distributions studied by George Zipf have extremely long tails, 
due to the presence of low-frequency words (such as hapax legomena). As a result, logarithmic 
transformation is usually applied to the well as the y-axis). The double logarithmic 

transformation is also justified by the expectation of a power- law functio n, as it will lead to a 
linear regression. This linear fitting in log- log scale may have its pitfall (IClauset et al.ll2009l ). 
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one being the uneven distribution of points along the log-transformed x-axis. 

For ranked letter frequency data, the finite number of alphabets sets an upper bound for 
the rank, and there is no large number of rare e vents which is an important theoretical issue in 



modeling the word rank-frequency distribution (IBaayen 



200 ll ). On the other hand, the limited 



range of abscissa may make it hard to distinguish different fitting functions. Since power-law 
function is not expected to be the best fitting function, double logarithmic transformation is 



not necessary, and we will fit the data in linear- 



fitting is carried out by nonlinear least-square (IBates and Watts! 1 19881 ) . 



inear scale. No longer linear fittings, the curve 



Statistical models with a larger number of free parameters will guarantee to fit the data 
better than a model subset with fewer number of parameters. To compare the performance 
of models with different number of parameters, a penalty should be imposed on the extra 



number of parameters. Towards this end, we app 



with Akaike Information criterion or A 
Bayesian Information Criterion or BIC 
fit the ranked letter frequency data. 



C flAkaike 



Schwarz 



y the standard model selection 



1974 



Burnham and Anderso: 







tech nique 



2002 ) and 



19781 ) to compare various functions used to 



2 Data 

US presidential inaugural speeches: In order to take into account of any possible letter 
usage trend in time, we use the US Presidential Inaugural Speech texts for the 44 presidents in 
the last 200 years. The data is downloaded from the The American Presidency Project from the 



University of California at Santa Barbara site {http://www.presidency.ucsh.edu/). Multiple 
inaugural speeches from the same person are combined into one, including the nonconsecutive 
presidency of Grover Cleveland. Five presidents did not give an inaugural speech (John Tyler, 
Millard Fillmore, Andrew Johnson, Chester Arthur, Gerald Ford). The final dataset consists 
of 38 text files. 

Mexico presidential addresses to the congress: For Spanish texts, we selected the 19 
Mexican presidents' report to congress (Informes Presidenciales) from 1914 to 2006. 



(http://www.diputados.gob.mx/cedia/sia/reJnfo.htm ) Again, addresses by the same presi- 



dent are combined into one text file. Some presidential texts are much shorter than others due 
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to two possible reasons: either did the president only present one address (the typical number 
of addresses is 6), such as Adolfo de la Huerta (1920) and Emilio Portes Gil (1929), or the 
president gave shorter reports, such as Ernesto Zedillo Ponce de Leon (1995-2000) and Vicente 
Fox Quesada (2001-2006). 



3 Letter frequencies and their temporal trends 

FigfjJA) shows the English letter frequency of the 38 US president's speeches, separated by 
the century. The letter e remains the most commonly used English letter with little change in 
its frequency. However there seems to be a trend of less usage of letter t, and more usage of 
letter w in the 20th century as compared to the 19th century. 

1 1 (A) 

1800s 




Figure 1: English (A) and Spanish (B) letter frequencies (unranked, in alphabetic order) for 38 U.S. presidential 
inaugural speeches and 19 Mexican presidents' report to congress. Letter frequencies of each president's speech 
are linked by a line, and different time periods are drawn separately (U.S. president speech: 1789-1800, shifting 
1801-1901 by 0.02, shifting 1905-now by 0.04; Mexico president speech: 1919-1934, shifting 1935-1964 by 0.02, 
shifting 1965-2005 by 0.04). Due to a larger sample size, the fluctuation of frequency from president to president 
in Spanish texts is much smaller than that in English texts. 



Similar frequencies of Spanish letters in the 19 Mexican presidents' addresses are shown 
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in FigOjB). Letters with accent (the acute accent for the vocals, the umlaut for the letter 
ii and the tilde for the n) are counted separatedly but later combined because they do not 
really represent different letters. The 19 files are arbitrarily split into three groups: the first 
7 presidents (from 1917 to 1934), the next 5 presidents (from 1935 to 1964), and the last 7 
presidents (from 1965 to 2006). These three groups are separatedly drawn in FigJT^B). The 
narrowing of the variations of letter frequency in FigfjjB) as compared to FigfjTA) is due to 
the larger sample sizes in Spanish texts. 

In Table [TJ English letters are sorted by their frequency of usage, from common to rare, for 
the 38 US president speeches. Again, e and t are consistently ranked as number 1 and 2 (with 
the exception of Clinton's speech, where o is ranked second), but the ranking order of a and i 
seem to change with time: in older speeches (e.g. before year of 1890), i is ranked higher than 
a, after 10 more presidents where i and a were used about equally, then the order is reversed 
for newer speeches (e.g. after the year 1960). 

Table |2] shows the corresponding sorting of Spanish letters in the 19 Mexico president 
addresses. The sequence eaosinr consistently appears at the head of the string. However, the 
order of d and I has been switched from dl in the first half of 20th century (until president 
Rodriuez whose term ended in 1934) to Id in the second half of the century (since president 
Aleman whose term started in 1946). 

To confirm the observation from FigJUand Tables fTH2l in Figj2]we directly plot the English 
letter frequencies of t,w,a,i and Spanish letter frequencies of d,l,m. Indeed, there is higher 
usage of w and lesser usage of t in recent US president speeches, and the relative order of a 
and t was switching from year 1889 to 1957. For Mexican president addresses, the letter I 
overcomes d in the last few decades. There is also an upward trend for the usage of Spanish 
letter m. 

Despite these interesting trends of a few letters for the last two hundreds of years for English 
and one hundred of years in Spanish, the overall letter frequencies remain more or less stable. 
We combine all 38 English files into one (and 19 Spanish files into one) to examine the rank 
frequency distribution. 
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name 


year(s) of speech 


sorting of alphabets 


num 


1 


Washington 


1789,1793 


etinoashrcdlumfpy bwgvxj qkz 


7710 


2 


Adams 


1797 


etnioasrhdlcfumpgybvwxj kqz 


11281 


3 


t ft' 

Jetierson 


1 oai -i or\ f 

1801,1805 


etoinasrhldcufmpwgybvkxj zq 


18701 


4 


Madison 


1809,1813 


ctoinasrhldcufpmgwbyvkxjzq 


11572 


5 


Monroe 


1817,1821 


etinoarshdclufpmwygbvxkjzq 


37522 


6 


Adams 


1825 


etoinasrhdlcfupmgybvwxj qkz 


14572 


7 


Jackson 


1829,1833 


etoinarshldcufmpybgwvxkjqz 


11372 


8 


Van Buren 


1837 


etoinasrhlducfpmy wgvbxkj qz 


19215 


9 


Harrison 


1841 


etoinarshcdlfumpygbwvxkjzq 


40526 


10 


Polk 


1845 


etoniasrhdlcufmpygbwvxj qkz 


23475 


11 


Taylor 


1849 


etoinasrhlcdufmpybgwvxkj zq 


5413 


12 


Pierce 


1853 


etinoarshlducfmpygwbvkxj qz 


16406 


13 


Buchanan 


1857 


etionasrhlcdufpmywgvbxqjkz 


13696 


14 


Lincoln 


1861,1865 


etoinasrhldcufpmy wbgvkxj qz 


19340 


15 


Grant 


1869,1873 


etoinarshldcufmpgy wbvxkqj z 


11476 


16 


Hayes 


1877 


etoinasrhlcdufpmygbwvj qxkz 


12171 


17 


Garfield 


1881 


etoniasrhlducfmpgwybvkxjqz 


14477 


18 


Cleveland 


1885 


etoinasrhdlcufpmygbwvxkzj q 


18480 


19 


Harrison 


1889 


etoniasrhldcufpmwygbvkxj qz 


21394 


20 


Mckinley 


1897,1901 


etnoiarshlducfpmygbwvxkjzq 


30179 


21 


T.Roosevelt 


1905 


etoainrshldufwcgbpmvykxjzq 


4480 


22 


Taft 


1909 


etoinasrhdclfumpgywbvkxjqz 


26272 


23 


Wilson 


1913,1917 


etoanisrhdlucfwpmgyvbkjqxz 


14360 


24 


Harding 


1921 


etnioarsldhcufmwpgybvkxzj q 


16508 


25 


Coolidge 


1925 


etonairshldcufmpwy bgvxkj qz 


19482 


26 


Hoover 


1929 


etoinarshldcufmpgy wbvzxj kq 


19256 


27 


F.D.Roosevelt 


1933,1937,1941,1945 


etoainrshldcfumpwygvbkjxzq 


25696 


28 


Truman 


1949 


etoainrshldcfumpwgy vbkj qxz 


11070 


29 


Eisenhower 


1953,1957 


etoainrshldfcumpwygbvkqjxz 


18313 


30 


Kennedy 


1961 


etoanrsihldfuwcmgypbvkjxzq 


6003 


31 


Johnson 


1965 


etanoirshdluwcfmgybpvkjxzq 


6468 


32 


Nixon 


1969,1973 


etoanirshldcuwfmpgby vkj qxz 


17142 


33 


Carter 


1977 


etaonirshldumwcfpgbyvkjqxz 


5459 


34 


Reagan 


1981, 1985 


etonarishdlumwcfgpybvkjxzq 


22494 


35 


G.H.W.Bush 


1989 


etaonrishdluwcmgfybpvkzjxq 


9781 


36 


Clinton 


1993,1997 


eotanrishldcumwfpgybvkjzxq 


16915 


37 


G.W.Bush 


2001,2005 


etonairsdhlcufmwygpbvkjzqx 


16759 


38 


Obama 


2009 


etoanrsihdlucwfmgypbvkj qxz 


10632 



Table 1: The names of the 38 U.S. presidents, the years of their inaugural speech, the order of letters ranked 
by their frequency in the corresponding president's speech, and the total counts of letters. 
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name 


years of speech 


sorting of alphabets 


num 


1 


Carranza 


1917,1918,1919 


eaosnirdlctupmbgyvfqhjxzhkw 


539107 


2 


Dc la Huerta 


1920 


eaosinrdlctupmbgfvyhqj zxhkw 


113057 


3 


Obrcgon 


1921,1922,1923,1924 


eaosinrdlctupmbgy fvhqj xzhkw 


675552 


4 


Eh'as 


1925,1926,1927,1928 


eaosinrdlctupmbgyfvqhjzxhkw 


700715 


5 


Portes Gil 


1929 


eaosinrdlctupmbgy vfqhj zxhkw 


231873 


6 


Ortiz 


1930,1931,1932 


eaoisnrdlctupmbgvfyqhjzxhkw 


664319 


7 


Rodriuez 


1933,1934 


eaoisnrdlctupmbgyvfqhjzxhkw 


301745 


8 


Cardenas 


1935,1936,1937,1938,1939,1940 


eaosinrldctupmbgvyfqhjxzhkw 


402748 


9 


Avila 


1941,1942,1943,1944,1945,1946 


eaosinrlcdtumpbygvfqhjzxhkw 


734540 


10 


Aleman 


1947,1948,1949,1950,1951,1952 


eaoisnrcltdumpbyvgfhqzjxhkw 


549980 


11 


Ruiz 


1953,1954,1955,1956,1957,1958 


eaosinrldctumpbygvfqhj zxhkw 


592550 


12 


Lopez 


1959,1960,1961,1962,1963,1964 


eaosinrldctupmbgvyfhqzjxhkw 


712056 


13 


Diaz 


1965,1966,1967,1968,1969,1970 


eaosinrldctupmbgvyfqhzjxhkw 


785528 


14 


Echeverrfa 


1971,1972,1973,1974,1975,1976 


eaosinrldctumpbvgfyqhj zxhkw 


792338 


15 


Lopez Portillo 


1977,1978,1979,1980,1981,1982 


eaosinrlcdtumpbygvfqhzjxhkw 


684658 


16 


Dc la Madrid 


1983,1984,1985,1986,1987,1988 


eaoisnrlcdtumpbyvgfhqzjxhkw 


761274 


17 


Salinas 


1989,1990,1991,1992,1993,1994 


eaosinrlcdtumpbvygfhqzxj hkw 


624933 


18 


Zedillo 


1995,1996,1997,1998,1999,2000 


eaosinrlcdtumpbgyvfqhzjxhkw 


282463 


19 


Fox 


2001,2002,2003,2004,2005 


eaosinrldctumpbgyvfqhzjxhkw 


311429 



Table 2: The last names of the 19 Mexican presidents, the years when they addressed the congress, the order 
of letters ranked by their frequency in the corresponding president's address, and the total counts of letters. 

4 Fitting ranked letter frequency distributions 

We used ten different functions to fit the ranked letter frequency distribution in US pres- 
idential inaugural speeches that is averaged over all 38 presidents, and Mexican presidential 
addresses to the congress averaged over 19 presidents. Here is a list of these functions (/ 
denotes the normalized letter frequency, r denotes the rank: r = 1 for most frequent letter 
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Figure 2: Temporal change of frequency in selected letters. (A) Letters t, i, a, w in 38 U.S. presidential 
speeches. (B) Letters d,l,m in 19 Mexican presidential speeches. 



and r = 26 (or 27) for the rarest letter, and n = 26, 27 is the maximum rank value): 

n + 1 



Gusein-Zade : / = Clog 



power-law 
exponential 
logarithmic 
Weibull 
quadratic logarithmic 
Yule 

Menzerath- Altmann / Inverse- Gamma 

Cocho/Beta 



./ 



c 

f = Ce 



f = C-a log(r) 



f = C — a log(r) — b (log(r)Y 
V 

f = C— 



f = C- 

f = C 



-b/r 



[n + 1 — r) 



Frappat : f = C + br + ce~ 



(1) 

(2) 
(3) 
(4) 
(5) 
(6) 
(7) 

(8) 

(9) 
(10) 
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Since / is the normalized frequency, Yli=i fi = 1> which adds a constrain on one parameter. 
The parameter under constraint is labeled as C whose value is generally of no interest to us. 
Besides C, the number of free (adjustable) parameters in these fitting functions ranges from 
(Gusein-Zade) to 3 (Frappat). The power-law, exponential, logarithmic, and Weibull functions 
have 1 free parameter, quadratic-logarithmic, Yule, Menzer at h- Alt mann / Inver se- G amma , and 
Cocho/Beta, functions have 2 free parameters, as discussed in (ILi et al.ll2010l ) 

The power- law (Eq.(2)) and exponential function (Eq.(3)) are often the first group of func- 
tion to be tested, due to their sim plicity and widespread applicability. The zero-free-parameter 
function (Gusein-Zade) in Eq.(l) JGusein-Zad3l987l . ll988l : lBorodovskv and Gusein-ZadJl98gl ) 



actually corresponds to the exponent ial cumulative distribution, and the Weibull function 
(Eq.(5)) ( INabeshima and Gunjil 120041 ) corresponds to the stretched exponential cumulative 
distribution. The conversion from cumulative distribution to rank distribution of these two 



functions are discussed in details in ( ILi et al.ll2010l ) 



The logarithmic function (Eq.(4)) is an extension of the Gusein-Zade function Clog(n + 
1) — Clog(r) by allowing the coefficient of log(r) term to be independently fitted. Then the 
quadratic logarithmic function is an extension of the logarithmic function by adding one extr a 



term. The logarithmic function is mentioned in ( iKanter and Kessler 



1995 



VladetaL 



20001), 



whereas quadratic logarithmic function has not been used to the best our knowledge. 



The three two-par ameter functions used are al 



tion: Yule function ( mile 



1925 



Martindale et al. 



Menzerath-Altmann or inverse-Gamma function (lAltmann 



attem pts to modify the power-law func- 



1996) uses an exponential function (6 r ), 



198 0J) uses an exponential function 



of the inverse of rank (e b ^ r ), and C ocho or Beta function ( Mansilla et al. 



2007 



Naumis and Cocho 



2008 



Martmez-Mekler et al. 



20091 ) uses a power-la w function of the reverse rank ((n + 1 —r) b ). 



The 3-parameter function in Eq.(8) proposed in (jFrappat et al. 



2003 



Frappat and Sciarrino 



20061 ) is to add a linear trend over the exponential function. 



All x and y relationship in Eqs.(l-lO) are non-linear. It is possible to transform variables 
or introduce new variables to carry out the fitting by multiple linear regression. For example, 
after define y' = log(/), x[ = log(r), x' 2 = log(n + l — r), the Cocho/Beta function is equivalent 
to a multiple regression y' = cq + ciXi + C2X2, where the regression coefficients can be converted 
back to the parameters used in Eq.(7): C = e Cl , a = — ci, b = 02- 
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The data-fitting result in the transformed variable, however, is generally not identical to the 
result in its original nonlinear form. Our method is to first use the multiple linear regression 
in the transformed version, if possible, in order to obtain a rough estimation of the parameter 
values. Then these valu es are used as the initia l condition for nonlinear least-square iteration 



using the nls function (jBates and Wattslll988l) in R: http: //www. r-project.org/) . 
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Figure 3: Fitting ranked English letter frequency of U.S. presidential speech by ten different functions: (A) 
power-law (a = 0.616) and exponential function (a = 0.118); (B) Gusein function (C = 0.0374); (C) logarithmic 
function (a = 0.0401); (D) Weibull function (a = 0.935); (E) quadratic logarithmic function (a = 0.0280, 
b = 0.00325); (F) Yule function (a = 0.0543, b = 0.897); (G) Menzerath-Altmann/Inverse-Gamma function 
(a = -1.05, b = -1.31); (H) Cocho/Beta function (a = 0.210, b = 1.35); (I) Frappat function (a = 0.245, 
b = -0.00242, c = 0.0813). The fitting performance measured by SSE and AIC/BIC is shown in Table [3] 



FigJH] shows the nonlinear least-square fitting of English letter ranked frequencies with all 
ten functions in Eqs.(l-lO), and FigJH shows the result for Spanish ranked letter frequencies. 
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Figure 4: Fitting ranked Spanish letter frequency of Mexican presidents' speech to congress by ten different 
functions: (A) power-law (a = 0.653) and exponential function (a = 0.130); (B) Gusein function (C = 0.0303); 
(C) logarithmic function (a = 0.0443); (D) Weibull function (a = 1.05); (E) quadratic logarithmic function 
(a = 0.0306, b = 0.00362); (F) Yule function (a = -0.0333, b = 0.873); (G) Menzerath-Altmann/Inverse- 
Gamma function (a = —1.22, b = —1.69); (H) Cocho/Beta function (a = 0.115, b — 2.04); (I) Frappat function 
(a = 0.0592, b = 0.00315, c = 0.276). The fitting performance measured by SSE and AIC/BIC is shown in 
Tabled 

The first impression of Figs l3]H1 is that all functions seem to fit the ranked letter frequency well, 
with the exception of power-law and Menzerath-Altmann functions. Is it possible to further 
distinguish those with even better fitting performance? That is the issue to be addressed in 
the next section. 
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Figure 5: Fitting errors (residual, deviance), y( r ) — f(r), of the ten functions used in Fig|3]for U.S. presidential 
speeches. 

5 Comparison of the fitting performance 

How well a function / fits the data can be measured by the sum of squared errors (residuals) 
SSE: 

n 

sse = J2(y t - f(*i)) a (ii) 
»=i 

where the parameters of the function are estimated by least-square or maximum likelihood 
method. It is not correct to compare two functions with different number of parameters, as 
the function with more parameters has more freedom to adjusting in order to achieve a higher 
fitting performance. In the extreme example, a function with unlimited number of parameters 
can fit a finite dataset perfectly: this overfitting situation is called saturation. 

To compare t wo functions with different number of parameters, the Akaike Informa tion 
Criterion (AIC) (1 Akaikd 1 1 9 74j ) and Bayesian Information Criterion (BIC) (jSchwarall978l ) can 
be used for model selection. Both criteria discount the (log) maximum likelihood of the 
fitting model by a term proportional to the number of parameters (p): AIC uses the term 
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Figure 6: Fitting errors (residual, deviance), y( r ) — /(?*), of the ten functions used in FigH] for Mexican 
presidents' speech to congress. 

2p, and BIC uses the term log(n)p (where n is the sample size). Maximizing the discounted 
maximum likelihood is our crit erion for the best model (equivalent to minimizing AIC or BIC) 



(jBurnham and Anderson 



20021 ) 



In regression models (linear or nonlinear), there is a simple relationship between AIC/BIC 
and SSE if we assume the variance of errors is unknown (and has to be estimated from the 
data), and if we assume the variance of the error is the same for all data points (details are in 
Appendix) . 

Table [3] shows the AIC model selection result for the fitting in Figj3] and FigJH The best 
function for both English and Spanish, selected by either AIC or BIC, is the Cocho/Beta 
function (FiglH^H), Figl3](H)). The second best function is the quadratic logarithmic function 
(FigJS^E), FigjU^E)). For English text, these functions are followed by Weibull, logarithmic, 
and Frappat functions. For Spanish texts, the two best functions are followed by Frappat, 
logarithmic, and exponential functions. 

A single SSE value does not tell us whether there exist systematic deviations (e.g., larger 
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function 


Eq. 


P 


English 


Spanish 








SSE 


A AIC 


A BIC 


SSE 


A AIC 


A BIC 


Gusein-Zade 


1 





0.00106 


20.2 


17.7 


0.00670 


57.3 


54.8 


power- law 


2 


1 


0.00461 


60.3 


59.0 


0.00721 


61.3 


60.0 


exponential 


3 


1 


.000814 


15.2 


14.0 


0.00118 


12.5 


11.2 


logarithmic 


4 


1 


.000635 


8.75 


7.49 


0.00115 


11.7 


10.4 


Weibull 


5 


2 


.000559 


7.45 


7.45 


0.00136 


18.2 


18.2 


quadratic log 


6 


2 


.000460 


2.40 


2.40 


.000915 


7.59 


7.59 


Yule 


7 


2 


.000788 


16.4 


16.4 


0.00117 


14.3 


14.3 


Menzerath-Altmann/Inverse-Gamma 


8 


2 


0.00251 


46.5 


46.5 


0.00340 


43.0 


43.0 


Cocho/Beta 


9 


2 


.000420 








.000691 








Frappat 


10 


3 


.000587 


10.7 


12.0 


.000838 


7.20 


8.49 



Table 3: Regression diagnosis and model selection of ten functions on English and Spanish letter rank- frequency 
plots. 

deviations at high rank numbers). To address this question, Figj5]and Fig J6] show the deviation 
at any rank number for all fitting functions, for English and Spanish respectively. It is inter- 
esting that functions with better fitting performance all have a similar pattern in rank-specific 
deviation. 



6 Piecewise functions 



The zero-parameter Gusein-Zade function corresponds t o a simple exponential cumulative 
distribution (CD) (for more discussions, see (jLi et al.ll2010[ )): 

r 



CD 



n + 1 



-f/c 



(12) 



In other words, the proportion of values that are larger than f is equal to e~^°^ c . Since Gusein- 
Zade function (Eq.(l)) can also be written as C = //log[(n+l)/r], if we plot /j/log((n+l)/rj) 
against rj (i — 1, 2, • - -n), this function predicts a plateau. 

FigJT] shows /j/log[(n + l)/r$] as a function of rank, for both English (black) and Spanish 
(red) letters. Surprisingly, instead of a plateau, we see step functions. For English letters, the 
top 21 letters (etoniarshldcufmpwygbv) form the first group, and the next 5 letters (kxjqz) 
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Figure 7: An alternative form of Gusein-Zade function is // \og( IL ^-) = C, and its validity can be checked by 
plotting //log( B ii) versus the rank r. 



form the second one. The average plateau height of the first group in FigJ7]is 0.0425, that of 
the second group is 0.0157. 

For Spanish letters in Figd three groups appear in a step function. The two rarest letters 
(Jew) are very different from others (average plateau height is 0.00165). This is a known fact 
as k and w are only used in foreign words. The top 14 letters (eaosnirldctump) are in one 
group (average height of 0.0437), and the next 11 letters (bgyvfqhjzxn) form the second group 
(height is 0.0185). When the Spanish data is compared to the English data, it is interesting 
that the plateau height of the two groups are similar across the language, whereas the number 
of letters in the lower-plateau is much larger in Spanish than in English. 

The result of FigJT] indicates that we may construct a piecewise Gusein-Zade function to fit 
the ranked letter frequency distribution. It should be noted that the number of parameters in a 
piecewise Gusein-Zade function is no longer zero. For two-piece function, three parameters are 
estimated: plateau height of the first {C\) and the second segment (C 2 ^ Ci), and the partition 
position in rr-axis (r ). This minus the normalization constraint leads to 2 free parameters. 
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This 3-parameter (2 of them are free) piecewise function can be written as: 

f dlog^ if r < r 
{ C 2 log^ if r >r 

For the English letter data in Fig]3j r is chosen at 22, least square regression leads to G\ = 
0.04065 and C 2 = 0.01394, and SSE= 0.000578. For Spanish letters, with 2-segment function 
partition at r = 15, d = 0.0424 and C 2 = 0.01897, and SSE= 0.000539. Using 3-segment 
function, SSE is improved only slightly to 0.000537. These results are comparable to the best 
SSE results obtained by the Beta function (Table Ej). 

7 Discussion 

So far we have not considered space as a "letter". The number of space is simply equal to 
the number of words (N space = N word ), and the space frequency is p space = N space /(N space + 
Natter)- For the US presidential speeches, the averaged p spa ce is 0.174. For Mexican presidential 
speeches, the averaged p spaC e is 0.162. There is a mild upward trend for p space in US presidential 
speeches, but such a temporal pattern is missing in Mexican texts. 

When the "space" is considered as a symbol, its frequency is higher than any other single 
letters. The rank-frequency plot with space symbol can still be fit perfectly by the Cocho/Beta 
function (result not shown). The Cocho/Beta is still the best function than others. However, 
the fitted coefficient values can be quite different when space-symbol is included. For example, 
for English texts, a = 0.21 and b = 1.35 without the space, but a = 0.50 and b = 0.875 with 
the space symbol. 

Due to the limited range of abscissa, many functions seem to fit the ranked letter fre- 
quency distribution very well, and any subtle change might disturb the relative performance 
among fitting functions. Take the — log(r) type functions for example, we have considered 
three similar functions already, Eq.(l), Eq.(4), Eq.(5), and Eq.(6). The quadratic logarithmic 
function Eq.(6) clearly outperforms Eq.(l) and Eq.(4), and competes with Cocho/Beta func- 
tion to become the best fitting function. We notice that in Gusein-Zade's original publication 



(lGusein-Zaddll988l ) . he proposed a function of the form / = (l/r + l/(r + l) • • • l/n)/n, which 
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also appeared in (IGamow and Yeas Ill955l ) after a random division of unit length problem by 
John von Neumann. That function can be approximated by a — log(r) function. 

The piecewise plateaus revealed by FigJT] seem to partition alphabets into discrete groups. 
For English, rare letters k,x,j,q,z form their own group, with frequencies much lower than 
expected by the log((n + l)/r function. For Spanish, besides the well know letter group of 
k,w, we found another group with letters b,g,y,v,f,q,h, j,z,x,n. The height of the first plateau 
is about twice that of the second plateau, for both English and Spanish. One hypothesis is 
that these lower-than-expected rare alphabets were originally paired as one letter, then each 
ancestral letter was split into two letters. Two such pairs can be imagined for English (discard 
z), and five pairs for Spanish (discard n). 

Of the ten functions used in this paper, some explicitly include the number of letters, n, 
as part of the modeling, whereas others do not. Those with n include Gusein-Zade, Weibull, 
and Cocho/Beta. For some linguistic data, the value of n is fundamentally undecided, for 
example, the number of words in a language. It is argue d that word d istribution should be 
better modeled by "large number of rare events" (LNRE) ( lBaayenll200ll ). One consequence of 
LNRE is that the number of words n increases with the text length (followed the Heaps' law 
(jlieaps. 1978 )). making the value of n uncertain. Fortunately, in letter frequencies, the value 
of n is independent of the text length. 

There might be deeper reasons why Cocho/Beta outperforms nine other functions in fit- 
ting our data. It was suggected that when a new random variable is constructed by allow- 
ing both addition and subtraction of independent and identically distributed random vari- 
ables, but within certain rang e, the new random variable follows the Cocho/Beta distribution 
(IBeltran del Rio et al. II2010I ). Perhaps Cocho/beta function is a limiting functional form for 
ranked data under a very general condition. 

In conclusion, we use ten functions to fit the English and Spanish ranked letter frequency 
distribution obtained from the US and Mexican presidential speeches. Cocho/Beta function is 
the best fitting function among the ten, judged by sum of errors (SSE) and Akaike information 
criterion (AIC). The quadratic logarithmic function is a close second best. We also discover 
a grouping of letters in both English and Spanish. The rarer-than-expected group in English 



consists of two pairs of letters whereas that in Spanish consists of five pairs. There is a third, 
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even-rarer-than-expected letter group in Spanish with k,w, consistent with the fact that these 
are only used for foreign words. Besides the Cocho/Beta and quadratic logarithmic function, it 
is not conclusive whether other functions follow a universal relative fitting performance order. 
Needless to say, studying letter frequencies in other languages could potentially answer this 
question. 



Appendix: Relationship between AIC/BIC and SSE 

Akaike information criterion is defined as: AIC = —2 log L + 2p, where L is maximized 
likelihood, p is the number of parameter in the statistic model. When a dataset is fitted by a 
model, if the error is normally distributed, the likelihood of the model is (n is the number of 
samples, a is the standard deviation of the normal distribution for the error, {^} are the data 
points, and {^} are the fitted value): 

" e -{y,-y) 2 /^ 2 e -Er(s/-«/) 2 )/2^ 2 

L = W V2^~ = (27RT 2 )"/ 2 (14) 



i=l 



The J2(Vi ~ Vi) 2 term can be called SSE (sum of squared errors). 
If the error variance is unknown, it can be estimated from the data: 

a 2 = ™E (15) 

n 

Replacing a by the estima ted a, we obtained the maximized likelihood, which after log is 



flVenables and Riplevll2002[ ): 



>, 77 77 

log(L) = C — — log(a 2 ) = C - - \og(SSE/n) (16) 

then, 

AIC = n \og(SSE/n) + 2-p + const. (17) 

and 

BIC = n \og(SSE/n) + log(n) • p + const. (18) 
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